Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645

hantangwangd · 2024-09-13T04:15:59Z

Description

Currently, when querying Iceberg tables, we always eagerly load all partition values of the table in PickTableLayout or IcebergFilterPushdown during the optimization phase. Due to the fact that this eagerly loaded partition values are currently only used in metadata based optimization rules, in many cases this information is not used at all. It can result in a lot of waste of resources and performance in the following cases:

For queries that cannot be optimized based on metadata, we do not use these partition values at all, so we shouldn't load them eagerly.
For tables with a huge number of partitions that are not suitable for metadata optimization, we need to limit the max number that can be loaded in the loading phase, rather than loading all of them first and then determine whether they exceed the threshold.

This PR makes the loading behavior of partitions lazy and support setting a threshold for the maximum number of partitions that can be loaded during the loading phase. In this way, we can avoid a lot of unnecessary loading in many scenarios, as well as the resulting resource consumption and performance loss.

The benchmark's results also support the above conclusion. We execute regular query statements, query statements which are applicable for metadata optimization, and query statements which are applicable for further reducible optimization on tables with two different partition numbers. Among them, 300 * 3 will make the table contain 900 partitions (not reaching the default threshold 1000), while 400 * 4 will make the table contain 1600 partitions (exceeding the default threshold of 1000). The code can be viewed here: https://github.com/hantangwangd/presto/blob/benchmark_for_lazy_load/presto-iceberg/src/test/java/com/facebook/presto/iceberg/BenchmarkIcebergQuery.java

The benchmark test result before this change is as follows:


----Before this change, always load partitions eagerly----

Benchmark                                  (recordCount)  Mode  Cnt     Score     Error  Units
BenchmarkIcebergQuery.testFurtherOptimize        300 * 3  avgt   10   154.888 ±   2.762  ms/op
BenchmarkIcebergQuery.testFurtherOptimize        400 * 4  avgt   10   201.214 ±   5.199  ms/op
BenchmarkIcebergQuery.testNormalQuery            300 * 3  avgt   10   673.455 ±  26.239  ms/op
BenchmarkIcebergQuery.testNormalQuery            400 * 4  avgt   10   907.111 ± 103.223  ms/op
BenchmarkIcebergQuery.testOptimize               300 * 3  avgt   10   252.138 ±   6.873  ms/op
BenchmarkIcebergQuery.testOptimize               400 * 4  avgt   10  1025.587 ±  78.809  ms/op

While the benchmark test result after this change is as follows:


----After this change, lazy load partitions and check the max threshold in loading phase----

Benchmark                                  (recordCount)  Mode  Cnt    Score    Error  Units
BenchmarkIcebergQuery.testFurtherOptimize        300 * 3  avgt   10  164.981 ±  3.552  ms/op
BenchmarkIcebergQuery.testFurtherOptimize        400 * 4  avgt   10  211.434 ±  4.327  ms/op
BenchmarkIcebergQuery.testNormalQuery            300 * 3  avgt   10  437.028 ± 49.367  ms/op
BenchmarkIcebergQuery.testNormalQuery            400 * 4  avgt   10  634.008 ± 54.919  ms/op
BenchmarkIcebergQuery.testOptimize               300 * 3  avgt   10  257.394 ±  7.223  ms/op
BenchmarkIcebergQuery.testOptimize               400 * 4  avgt   10  808.018 ± 36.952  ms/op

Due to the issues mentioned above, we found that this change significantly improves the performance of queries that are not suitable for metadata optimization, while for statements that can be optimized based on metadata, the performance also improves when the number of partitions exceeds the threshold. This is in line with expectations, and it can be anticipated that as the number of partitions increases, the performance improvement will further increase.

Motivation and Context

Make partitions loading for iceberg table lazy to avoid unnecessary loading

Impact

N/A

Test Plan

Make sure the change do not affect existing tests
Newly added test case in TestIcebergLogicalPlanner to show the behaviors with different max partition thresholds
Benchmark tests in local to make sure the improvement in performance

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== RELEASE NOTES ==


Iceberg Connector Changes
* Make partitions loading lazy for Iceberg table to avoid unnecessary loading. :pr:`23645`

hantangwangd added 2 commits September 14, 2024 19:43

[Iceberg]Make partitions loading lazy to avoid unnecessary loading

8a26146

[Iceberg]Make the max threshold of loaded partitions configurable

e390db7

hantangwangd force-pushed the make_partition_loading_lazy branch from 9028384 to e390db7 Compare September 14, 2024 12:11

hantangwangd marked this pull request as ready for review September 15, 2024 03:31

hantangwangd requested review from ZacBlanco and a team as code owners September 15, 2024 03:31

hantangwangd requested review from presto-oss, imjalpreet, aaneja and tdcmeehan and removed request for presto-oss September 15, 2024 03:31

hantangwangd changed the title ~~[WIP]Make partitions loading for iceberg table lazy to avoid unnecessary loading~~ Make partitions loading for iceberg table lazy to avoid unnecessary loading Sep 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645

Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645

hantangwangd commented Sep 13, 2024 •

edited

Loading

Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645

Are you sure you want to change the base?

Make partitions loading for iceberg table lazy to avoid unnecessary loading #23645

Conversation

hantangwangd commented Sep 13, 2024 • edited Loading

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

hantangwangd commented Sep 13, 2024 •

edited

Loading